Machine Learning Systems and Methods for Using an Orthogonality Heuristic to Identify an Ignored Labeling Target

ABSTRACT

Among a great deal of other disclosure and scope, systems and methods are enclosed that enable efficient assessment of the currently known manifolds within a problem space. A set of labeled vectors is identified as well as a set of unlabeled vectors. An angular based comparison is made between each unlabeled vector and each labeled vector. If the smallest angle between a given unlabeled vector and any of the labeled vectors is deemed satisfactory, such as when the angle is small and acute, the vector is deemed not crucial to obtain information regarding. However, if the smallest between a given unlabeled vector and any of the labeled vectors is deemed large, such as when the angle is orthogonal to the labeled set, then the given vector possesses vital information pivotal to learning our problem space. All such vectors are ranked, with the unlabeled vectors with the largest angles to our labeled set sent to our oracle first in order to improve our labeled set of vectors.

COPYRIGHT NOTICE

Contained herein is material that is subject to copyright protection.The copyright owner has no objection to the facsimile reproduction ofthe patent disclosure by any person as it appears in the Patent andTrademark Office patent files or records, but otherwise reserves allrights to the copyright whatsoever. Copyright © 2021, Fortinet, Inc.

BACKGROUND Field

Embodiments of the present disclosure generally relate to machinelearning, and more particularly to systems and methods for identifyingvectors for labeling for a learning model.

Description of the Related Art

Supervised learning involves manual labeling of large amounts of data toproperly train machine learning models. Such an approach, whileeffective in developing helpful models, is often cost prohibitive.Active learning on the other hand seeks to reduce the number of labelsneeded to meaningfully train a model. Such active learning relies onactively selecting queries to direct labeling. While active learning canreduce the cost of labeling, there is no guarantee that it will work fora particular problem space under consideration. Indeed, research hasshown that in some cases active labeling can actually require thelabeling of more data than randomly applied data labeling.

Hence, there exists a need in the art for improved approaches forlabeling.

SUMMARY

Embodiments of the present disclosure generally relate to machinelearning, and more particularly to systems and methods for identifyingvectors for labeling for a learning model.

This summary provides only a general outline of some embodiments. Manyother objects, features, advantages and other embodiments will becomemore fully apparent from the following detailed description, theappended claims and the accompanying drawings and figures.

BRIEF DESCRIPTION OF THE DRAWINGS

In the figures, similar components and/or features may have the samereference label. Further, various components of the same type may bedistinguished by following the reference label with a second label thatdistinguishes among the similar components. If only the first referencelabel is used in the specification, the description applies to any oneof the similar components having the same first reference labelirrespective of the second reference label.

FIGS. 1A-1B illustrate an example system in which a machine learningmodel training system may be deployed in accordance with someembodiments;

FIG. 2 is a flow diagram showing a method in accordance with variousembodiments for machine learning model development;

FIGS. 3A-3E show an example of processing high dimensional data that maybe used in relation to some embodiments;

FIG. 4 is a flow diagram showing a method for performing multipleprocess feature calculation on labeled input vectors in accordance withsome embodiments;

FIG. 5 is a flow diagram showing a method in accordance with someembodiments for ranking unlabeled input vectors in accordance withvarious embodiments;

FIGS. 6A-6E is an example graphically depicting the vector rankingprocess of FIG. 5 ;

FIG. 7 shows an example VQNN that may be used to perform the vectorranking processes discussed in relation to FIG. 5 ;

FIG. 8 is a flow diagram showing a method for adaptive vector labelingin accordance with various embodiments;

FIG. 9 shows an example VPLNN that may be used to perform the vectorlabeling processes discussed in relation to FIG. 8 ;

FIG. 10 shows a DRU that may be used in relation to various embodiments;

FIG. 11 is a flow diagram showing a method in accordance with someembodiments for using perturbation to identify high value labelingtargets;

FIG. 12 is a flow diagram showing another method in accordance withvarious embodiments for using perturbation to identify high valuelabeling targets; and

FIG. 13 is a flow diagram showing a method in accordance with someembodiments for using an orthogonality heuristic to identify high valuelabeling targets.

DETAILED DESCRIPTION

Embodiments of the present disclosure generally relate to machinelearning, and more particularly to systems and methods for identifyingvectors for labeling for a learning model.

It has been found that the issue with traditional active learning isthat it focuses on a singular, model specific strategy. While thisapproach works for many problem spaces, each model specific strategy haslimitations, such as: uncertainty sampling's susceptibility to chooseoutliers, and query-by-committee approaches focusing onnon-consequential regions of the problem space. Various embodiments setforth herein utilize multiple heuristics as part of identifying vectorsfor labeling.

Such systems and methods may be used in relation to a variety of problemspaces to train machine learning models that can be deployed in a largenumber of applications. Such applications may include, but are notlimited to, surveillance systems or network security appliances. Basedupon the disclosure provided herein, one of ordinary skill in the artwill recognize a variety of applications into which machine learningmodels trained in accordance with embodiments discussed herein may bedeployed.

Embodiments of the present disclosure include various processes, whichwill be described below. The processes may be performed by hardwarecomponents or may be embodied in machine-executable instructions, whichmay be used to cause a general-purpose or special-purpose processorprogrammed with the instructions to perform the steps. Alternatively,steps may be performed by a combination of hardware, software, firmware,and/or by human operators.

Various embodiments may be provided as a computer program product, whichmay include a machine-readable storage medium tangibly embodying thereoninstructions, which may be used to program the computer (or otherelectronic devices) to perform a process. The machine-readable mediummay include, but is not limited to, fixed (hard) drives, magnetic tape,floppy diskettes, optical disks, compact disc read-only memories(CD-ROMs), and magneto-optical disks, semiconductor memories, such asROMs, PROMs, random access memories (RAMs), programmable read-onlymemories (PROMs), erasable PROMs (EPROMs), electrically erasable PROMs(EEPROMs), flash memory, magnetic or optical cards, or other types ofmedia/machine-readable medium suitable for storing electronicinstructions (e.g., computer programming code, such as software orfirmware).

Various methods described herein may be practiced by combining one ormore machine-readable storage media containing the code according to thepresent disclosure with appropriate standard computer hardware toexecute the code contained therein. An apparatus for practicing variousembodiments of the present disclosure may involve one or more computers(or one or more processors within the single computer) and storagesystems containing or having network access to a computer program(s)coded in accordance with various methods described herein, and themethod steps of the disclosure could be accomplished by modules,routines, subroutines, or subparts of a computer program product.

In the following description, numerous specific details are set forth inorder to provide a thorough understanding of example embodiments. Itwill be apparent, however, to one skilled in the art that embodimentsdescribed herein may be practiced without some of these specific details

Terminology

Brief definitions of terms used throughout this application are givenbelow.

The terms “connected” or “coupled” and related terms are used in anoperational sense and are not necessarily limited to a direct connectionor coupling. Thus, for example, two devices may be coupled directly orvia one or more intermediary media or devices. As another example,devices may be coupled in such a way that information can be passedthere between, while not sharing any physical connection with oneanother. Based on the disclosure provided herein, one of ordinary skillin the art will appreciate a variety of ways in which connection orcoupling exists in accordance with the aforementioned definition.

If the specification states a component or feature “may,” “can,”“could,” or “might” be included or have a characteristic, thatparticular component or feature is not required to be included or havethe characteristic.

As used in the description herein and throughout the claims that follow,the meaning of “a,” “an,” and “the” includes plural reference unless thecontext clearly dictates otherwise. Also, as used in the descriptionherein, the meaning of “in” includes “in” and “on” unless the contextclearly dictates otherwise.

The phrases “in an embodiment,” “according to one embodiment,” and thelike generally mean the particular feature, structure, or characteristicfollowing the phrase is included in at least one embodiment of thepresent disclosure and may be included in more than one embodiment ofthe present disclosure. Importantly, such phrases do not necessarilyrefer to the same embodiment.

As used herein, a “surveillance system” or a “video surveillance system”generally refers to a system including one or more video cameras coupledto a network. The audio and/or video captured by the video cameras maybe live monitored and/or transmitted to a central location forrecording, storage, and/or analysis. In some embodiments, a networksecurity appliance may perform video analytics on video captured by asurveillance system and may be considered to be part of the surveillancesystem.

As used herein, a “network security appliance” or a “network securitydevice” generally refers to a device or appliance in virtual or physicalform that is operable to perform one or more security functions. Somenetwork security devices may be implemented as general-purpose computersor servers with appropriate software operable to perform one or moresecurity functions. Other network security devices may also includecustom hardware (e.g., one or more custom Application-SpecificIntegrated Circuits (ASICs)). A network security device is typicallyassociated with a particular network (e.g., a private enterprisenetwork) on behalf of which it provides one or more security functions.The network security device may reside within the particular networkthat it is protecting, or network security may be provided as a servicewith the network security device residing in the cloud. Non-limitingexamples of security functions include authentication, next-generationfirewall protection, antivirus scanning, content filtering, data privacyprotection, web filtering, network traffic inspection (e.g., securesockets layer (SSL) or Transport Layer Security (TLS) inspection),intrusion prevention, intrusion detection, denial of service attack(DoS) detection and mitigation, encryption (e.g., Internet ProtocolSecure (IPsec), TLS, SSL), application control, Voice over InternetProtocol (VoIP) support, Virtual Private Networking (VPN), data leakprevention (DLP), antispam, antispyware, logging, reputation-basedprotections, event correlation, network access control, vulnerabilitymanagement, and the like. Such security functions may be deployedindividually as part of a point solution or in various combinations inthe form of a unified threat management (UTM) solution. Non-limitingexamples of network security appliances/devices include networkgateways, VPN appliances/gateways, UTM appliances (e.g., the FORTIGATEfamily of network security appliances), messaging security appliances(e.g., FORTIMAIL family of messaging security appliances), databasesecurity and/or compliance appliances (e.g., FORTIDB database securityand compliance appliance), web application firewall appliances (e.g.,FORTIWEB family of web application firewall appliances), applicationacceleration appliances, server load balancing appliances (e.g.,FORTIBALANCER family of application delivery controllers), vulnerabilitymanagement appliances (e.g., FORTISCAN family of vulnerabilitymanagement appliances), configuration, provisioning, update and/ormanagement appliances (e.g., FORTIMANAGER family of managementappliances), logging, analyzing and/or reporting appliances (e.g.,FORTIANALYZER family of network security reporting appliances), bypassappliances (e.g., FORTIBRIDGE family of bypass appliances), Domain NameServer (DNS) appliances (e.g., FORTIDNS family of DNS appliances),wireless security appliances (e.g., FORTIWIFI family of wirelesssecurity gateways), and DoS attack detection appliances (e.g., theFORTIDDOS family of DoS attack detection and mitigation appliances).

Various embodiments provide methods for labeling a dataset that include:selecting, by a processing resource, an unlabeled data element from aset of unlabeled data elements to yield a selected, unlabeled dataelement, where non-selected unlabeled data elements in the set ofunlabeled data elements are a non-selected set of unlabeled dataelements; selecting, by the processing resource, a subset of thenon-selected set of unlabeled data elements; merging, by the processingdevice, the selected, unlabeled data element with the subset of thenon-selected set of unlabeled data elements to yield a merged, unlabeleddataset; forming, by the processing device, a union of the merged,unlabeled dataset and a labeled dataset to yield a union dataset; andcalculating, by the processing resource, an expected performance valueof the union dataset.

In some instances of the aforementioned embodiments, the set ofunlabeled data elements is a set of unlabeled vectors, and wherein thelabeled dataset is a set of labeled vectors. In some such instances, themethods further include generating the set of unlabeled vectors using acombination of at least a first heuristic and a second heuristic. Insome cases, the first heuristic is selected as one of: a Shannon'sentropy heuristic, a confidence based heuristic, a distance fromdecision hyperplane heuristic, an orthogonality to labeled pointsheuristic, an information density heuristic, a perturbation heuristic,an expected gradient length heuristic, and a consensus based heuristic;and the second heuristic is different from the first heuristic andselected as another of: the Shannon's entropy heuristic, the confidencebased heuristic, the distance from decision hyperplane heuristic, theorthogonality to labeled points heuristic, the information densityheuristic, the perturbation heuristic, the expected gradient lengthheuristic, and the consensus based heuristic. In one or more instances,the methods further include generating the set of unlabeled vectorsusing a combination of four or more of the following heuristics: aShannon's entropy heuristic, a confidence based heuristic, a distancefrom decision hyperplane heuristic, an orthogonality to labeled pointsheuristic, an information density heuristic, a perturbation heuristic,an expected gradient length heuristic, and a consensus based heuristic.

In one or more instances of the aforementioned embodiments, the methodsfurther include changing, by the processing resource, an order ofunlabeled data elements in the non-selected set of unlabeled dataelements prior to selecting the subset of the non-selected set ofunlabeled data elements. In some instances of the aforementionedembodiments, selecting the subset of the non-selected set of unlabeleddata elements is done using a step size variable indicating an offsetinto the non-selected set of unlabeled data elements. In variousinstances of the aforementioned embodiments where the subset of thenon-selected set of unlabeled data elements is a first subset of thenon-selected set of unlabeled data elements; the merged, unlabeleddataset is a first merged, unlabeled dataset; the union dataset is afirst union dataset; the expected performance value is a firstperformance value; the methods further include: selecting, by theprocessing resource, a second subset of the non-selected set ofunlabeled data elements; merging, by the processing device, theselected, unlabeled data element with the second subset of thenon-selected set of unlabeled data elements to yield a second merged,unlabeled dataset; forming, by the processing device, a union of thesecond merged, unlabeled dataset and the labeled dataset to yield asecond union dataset; and calculating a second expected performancevalue of the second union dataset. In some such instances, the methodsfurther include: combining, by the processing resource, at least thefirst expected performance value with the second expected performancevalue to yield a composite performance value for the selected, unlabeleddata element; and ranking, by the processing resource, the selected,unlabeled data element relative to at least one of the non-selectedunlabeled data elements based at least in part on the compositeperformance value. In some cases combining at least the first expectedperformance value with the second expected performance value to yieldthe composite performance value for the selected, unlabeled data elementis: averaging, by the processing resource, at least the first expectedperformance value with the second expected performance value to yieldthe composite performance value.

In some instances of the aforementioned embodiments, where selected,unlabeled data element is a first selected, unlabeled data element; thenon-selected set of unlabeled data elements is a first non-selected setof unlabeled data elements; the subset of the non-selected set ofunlabeled data elements is a first subset of the non-selected set ofunlabeled data elements; the merged, unlabeled dataset is a firstmerged, unlabeled dataset; wherein the union dataset is a first uniondataset; wherein the expected performance value is a first performancevalue; and wherein the methods further include: selecting, by theprocessing resource, a second unlabeled data element from the set ofunlabeled data elements to yield a second selected, unlabeled dataelement, wherein non-selected unlabeled data elements in the set ofunlabeled data elements are a second non-selected set of unlabeled dataelements; selecting, by the processing resource, a second subset of thenon-selected set of unlabeled data elements; merging, by the processingdevice, the second selected, unlabeled data element with the secondsubset of the non-selected set of unlabeled data elements to yield asecond merged, unlabeled dataset; forming, by the processing device, asecond union of the second merged, unlabeled dataset and the labeleddataset to yield a second union dataset; and calculating a secondexpected performance value of the second union dataset. In some cases,the methods further include using at least the first expectedperformance value and the second expected performance value to rank thefirst selected, unlabeled data element relative to the second selected,unlabeled data element.

Other embodiments provide systems for labeling a dataset that include: aprocessing resource, and a non-transitory computer-readable mediumcoupled to the processing resource. The non-transitory computer-readablemedium has stored therein instructions that when executed by theprocessing resource cause the processing resource to: select anunlabeled data element from a set of unlabeled data elements to yield aselected, unlabeled data element, wherein non-selected unlabeled dataelements in the set of unlabeled data elements are a non-selected set ofunlabeled data elements; select a subset of the non-selected set ofunlabeled data elements; merge the selected, unlabeled data element withthe subset of the non-selected set of unlabeled data elements to yield amerged, unlabeled dataset; form a union of the merged, unlabeled datasetand a labeled dataset to yield a union dataset; and calculate anexpected performance value of the union dataset.

Yet other embodiments provide non-transitory computer-readable storagemedia embodying a set of instructions, which when executed by one ormore processing resources of a computer system, causes the one or moreprocessing resources to: select an unlabeled data element from a set ofunlabeled data elements to yield a selected, unlabeled data element,where non-selected unlabeled data elements in the set of unlabeled dataelements are a non-selected set of unlabeled data elements; selecting asubset of the non-selected set of unlabeled data elements; merge theselected, unlabeled data element with the subset of the non-selected setof unlabeled data elements to yield a merged, unlabeled dataset; form aunion of the merged, unlabeled dataset and a labeled dataset to yield aunion dataset; and calculate an expected performance value of the uniondataset.

Additional embodiments provide methods for training a mathematical modelusing spatial emphasis. Such methods include: receiving, by a processingresource, a set of vectors to be ranked; applying, by the processingresource, a mathematical model to the set of vectors to be ranked toyield a set of predicted vectors; using, by the processing resource, aspatial emphasis value, the set of vectors to be ranked, and the set ofpredicted vectors in a scaling function to enhance a region of interestwithin a range expected for the set of vectors to yield a tuned scalingfunction; and training, by the processing resource, the mathematicalmodel on the tuned scaling function.

In some instances of the aforementioned embodiments, the mathematicalmodel is a neural network model. In various instances of theaforementioned embodiments, the scaling function is a function of: thespatial emphasis value; an expected label for each of the set of vectorsto be ranked; and a label predicted by the vector ranking model for eachof the set of vectors to be ranked. In some cases, the spatial emphasisvalue is one. In various cases, the scaling function is further afunction of a weight decay tuning value. In some such cases, the methodsfurther include determining, by the processing resource, the weightdecay tuning value using Tree Parzen Estimation.

In various instances of the aforementioned embodiments the scalingfunction includes a combination of only exponent, square, and linearfunctions. In some instances of the aforementioned embodiments, thescaling function is an exponential loss function. In various instancesof the aforementioned embodiments, the scaling function is:

$\frac{1}{N}{\sum\limits_{i}{{\exp( {- \frac{( {y_{i} - {region}_{interest}} )^{2}}{2\tau^{2}}} )}( {y_{i} - {\hat{y}}_{i}} )^{2}}}$

where region_(interest) is the spatial emphasis value, y_(i) is thelabel that the vector ranking model should have provided, ŷ_(i) is thelabel predicted by the vector ranking model, N is the number of vectorsin the set of vectors, i is a counter from 1 to N, and τ is a weightdecay tuning value.

Additional embodiments provide systems for training a mathematical modelusing spatial emphasis. Such systems include: a processing resource anda non-transitory computer-readable medium coupled to the processingresource. The non-transitory computer-readable medium has stored thereininstructions that when executed by the processing resource cause theprocessing resource to: receive a set of vectors to be ranked; apply amathematical model to the set of vectors to be ranked to yield a set ofpredicted vectors; use a spatial emphasis value, the set of vectors tobe ranked, and the set of predicted vectors in a scaling function toenhance a region of interest within a range expected for the set ofvectors to yield a tuned scaling function; and train the mathematicalmodel on the tuned scaling function.

Further embodiments provide non-transitory computer-readable storagemedia embodying a set of instructions, which when executed by one ormore processing resources of a computer system, causes the one or moreprocessing resources to: receive a set of vectors to be ranked; apply amathematical model to the set of vectors to be ranked to yield a set ofpredicted vectors; use a spatial emphasis value, the set of vectors tobe ranked, and the set of predicted vectors in a scaling function toenhance a region of interest within a range expected for the set ofvectors to yield a tuned scaling function; and train the mathematicalmodel on the tuned scaling function.

Yet further embodiments provide methods for automated handling of dataand conceptual drift. Such methods include: receiving, by a processingresource, at least a first decision output and a first confidence valuecorresponding to the first decision output and a second decision outputand a second confidence value corresponding to the second decisionoutput, wherein the first decision output and the second decision outputare included in a set of decision outputs from a first mathematicalmodel; selecting, by the programming resource, the first decision outputfor inclusion in a subset of the set of decision outputs based upon thefirst confidence value exceeding a confidence threshold value; applying,by the processing resource, a second mathematical model to a datasetincluding the subset of the set of decision outputs, wherein the secondmathematical model provides an updated decision output corresponding tothe first decision output; and selecting, by the processing resource,the first decision output for labelling based at least in part on acombination of the first decision output and the updated decisionoutput.

In some instances of the aforementioned embodiments, the datasetincluding the subset of the set of decision outputs further includes aplurality of previously labelled decision outputs. In some suchinstances, the methods further include: labelling, by the processingresource, the first decision output to yield a newly labelled decisionoutput; and adding, by the processing resource, the newly labelleddecision output to the plurality of previously labelled decisionoutputs.

In some instances of the aforementioned embodiments, the method furtherinclude comparing, by the processing resource, the first decision outputwith one of the previously labelled decision outputs to yield acomparison result. In such instances selecting the first decision outputfor labelling is done based at least in part on the combination of thefirst decision output and the updated decision output, and upon thecomparison result. In some such instances, the comparison resultindicates that the one of the previously labelled decision outputs issimilar to the first decision output.

In various instances of the aforementioned embodiments, the methodsfurther include excluding, by the processing resource, the seconddecision output from inclusion in the subset of the set of decisionoutputs based upon the second confidence value being less than theconfidence threshold value. In one or more instances of theaforementioned embodiments, selecting the first decision output forlabelling based at least in part on the combination of the firstdecision output and the updated decision output includes selecting thefirst decision output for labelling based at least in part on the firstdecision output matching the updated decision output.

In some instances of the aforementioned embodiments, the secondmathematical model is a neural network model. In various instances ofthe aforementioned embodiments, the methods further includeautomatically updating, by the processing resource, a pre-trained modelto protect against temporal shifts in data, wherein the longevity of thepre-trained model is increased.

Other embodiments provide systems for automated handling of data andconceptual drift that include a processing resource, and anon-transitory computer-readable medium coupled to the processingresource. The non-transitory computer-readable medium has stored thereininstructions that when executed by the processing resource cause theprocessing resource to: receive at least a first decision output and afirst confidence value corresponding to the first decision output and asecond decision output and a second confidence value corresponding tothe second decision output, wherein the first decision output and thesecond decision output are included in a set of decision outputs from afirst mathematical model; select the first decision output for inclusionin a subset of the set of decision outputs based upon the firstconfidence value exceeding a confidence threshold value; apply a secondmathematical model to a dataset including the subset of the set ofdecision outputs, wherein the second mathematical model provides anupdated decision output corresponding to the first decision output; andselect the first decision output for labelling based at least in part ona combination of the first decision output and the updated decisionoutput. In various instances of the aforementioned embodiments, theinstructions that when executed by the processing resource further causethe processing resource to automatically update a pre-trained model toprotect against temporal shifts in data, wherein the longevity of thepre-trained model is increased.

Further embodiments provide non-transitory computer-readable storagemedia embodying a set of instructions, which when executed by one ormore processing resources of a computer system, causes the one or moreprocessing resources to: receive at least a first decision output and afirst confidence value corresponding to the first decision output and asecond decision output and a second confidence value corresponding tothe second decision output, wherein the first decision output and thesecond decision output are included in a set of decision outputs from afirst mathematical model; select the first decision output for inclusionin a subset of the set of decision outputs based upon the firstconfidence value exceeding a confidence threshold value; apply a secondmathematical model to a dataset including the subset of the set ofdecision outputs, wherein the second mathematical model provides anupdated decision output corresponding to the first decision output; andselect the first decision output for labelling based at least in part ona combination of the first decision output and the updated decisionoutput.

Yet further embodiments provide methods for identifying a high valuelabeling target that include: receiving, by a processing resource, afirst set of data elements including at least a first data element and asecond data element; applying, by the processing resource, amathematical model to the first set of data elements to yield at least:a first predicted output corresponding to the first data element, and asecond predicted output corresponding to the second data element;adding, by the processing resource, a perturbation to the first dataelement to yield a perturbed data element; applying, by the processingresource, the mathematical model to a second set of data elementsincluding the perturbed data element to yield at least a third predictedoutput corresponding to the perturbed data element; and using, by theprocessing resource, a combination of the first predicted output and thethird predicted output to determine a labeling value of the first dataelement. In some instances of the aforementioned embodiments, the firstdata element is a first vector, wherein the second data element is asecond vector, and wherein the set of data elements is a set of vectors.

In various instances of the aforementioned embodiments, using thecombination of the first predicted output and the third predicted outputto determine the labeling value of the first data element includes:calculating, by the processing resource, divergence of the firstpredicted output to yield a first divergence; calculating, by theprocessing resource, divergence of the third predicted output to yield asecond divergence; and using, by the processing resource, a combinationof the first divergence and the second divergence to determine alabeling value of the first data element. In some cases, both the firstdivergence and the second divergence are calculated using aKullback-Leibler algorithm in accordance with the following equation:

D _(KL)(p(y|x)∥p(y|x+ϵ).

In various cases, using the combination of the first divergence and thesecond divergence to determine a labeling value of the first dataelement includes: calculating, by the processing resource, a differencebetween the first divergence and the second divergence to yield adivergence difference; and comparing, by the processing resource, thedivergence difference to a threshold value, where upon determining thatthe divergence difference exceeds the threshold value, the first dataelement is identified as a high value labeling target.

In some instances of the aforementioned embodiments where the perturbeddata element is a first perturbed data element, the methods furtherinclude: calculating, by the processing resource, divergence of thesecond predicted output to yield a third divergence; adding, by theprocessing resource, the perturbation to the second data element toyield a second perturbed data element, wherein the second set of dataelements includes the second perturbed data element, and whereinapplying the mathematical model to the second set of data elementsfurther yields: a fourth predicted output corresponding to the secondperturbed data element; calculating, by the processing resource,divergence of the fourth predicted output to yield a fourth divergence;and using, by the processing resource, a combination of the thirddivergence and the fourth divergence to determine a labeling value ofthe second data element.

In various instances of the aforementioned embodiments where the firstpredicted output is a first class, and the third predicted output is asecond class, using the combination of the first predicted output andthe third predicted output to determine labeling value of the first dataelement includes: identifying, by the processing resource, the firstdata element as a high value labeling target where the first class isdifferent from the second class. In other instances of theaforementioned embodiments where the first predicted output is a firstclass, and the third predicted output is a second class, using thecombination of the first predicted output and the second predictedoutput to determine labeling value of the first data element includes:identifying, by the processing resource, the first data element as a lowvalue labeling target where the first class is the same as the secondclass.

In some instances of the aforementioned embodiments, the methods furtherinclude using, by the processing resource, the labeling value of thefirst vector along with the result of at least one other heuristic torank the first vector relative to the second vector. In some suchinstances the at least one other heuristic is one of: a Shannon'sentropy heuristic, a confidence based heuristic, an orthogonality tolabeled points heuristic, a distance from decision hyperplane heuristic,an information density heuristic, an expected gradient length heuristic,or a consensus based heuristic.

Additional embodiments provide systems for identifying a high valuelabeling target that include: a processing resource and a non-transitorycomputer-readable medium coupled to the processing resource. Thenon-transitory computer readable medium has stored therein instructionsthat when executed by the processing resource cause the processingresource to: receive a first set of data elements including at least afirst data element and a second data element; apply a mathematical modelto the first set of data elements to yield at least: a first predictedoutput corresponding to the first data element, and a second predictedoutput corresponding to the second data element; add a perturbation tothe first data element to yield a perturbed data element; apply themathematical model to a second set of data elements including theperturbed data element to yield at least a third predicted outputcorresponding to the perturbed data element; and use a combination ofthe first predicted output and the third predicted output to determine alabeling value of the first data element.

Yet further embodiments provide methods for identifying an ignoredlabeling target. Such methods include: receiving, by a processingresource, a set of vectors including at least an unlabeled vector, afirst labeled vector, and a second labeled vector; calculating, by theprocessing resource, a first angle between the unlabeled vector and thefirst labeled vector, and a second angle between the unlabeled vectorand the second labeled vector; and using, by the processing resource, acombination of the first angle and the second angle to determine alabeling value of the unlabeled vector.

In some instances of the aforementioned embodiments, using thecombination of the first angle and the second angle to determine alabeling value of the unlabeled vector includes: determining, by theprocessing resource, that the first angle is less than the second angle;and identifying, by the processing resource, the first angle as aminimum angle based at least in part on determining that the first angleis less than the second angle. In some such instances, using thecombination of the first angle and the second angle to determine alabeling value of the unlabeled vector further includes comparing, bythe processing resource, the minimum angle with a threshold value. Invarious cases, using the combination of the first angle and the secondangle to determine a labeling value of the unlabeled vector furtherincludes identifying, by the processing resource, the unlabeled vectoras a high value labeling target where the minimum angle exceeds thethreshold value. In some cases, the threshold value is userprogrammable.

In various instances of the aforementioned embodiments, the methodsfurther include using, by the processing resource, the labeling value ofthe unlabeled vector along with the result of at least one otherheuristic to rank the unlabeled vector relative to other unlabeledvectors. In some such instances, the at least one other heuristic is oneof: a Shannon's entropy heuristic, a confidence based heuristic, adistance from decision hyperplane heuristic, an information densityheuristic, a perturbation heuristic, an expected gradient lengthheuristic, or a consensus based heuristic.

Additional embodiments provide systems for identifying an ignoredlabeling target that include a processing resource, and a non-transitorycomputer-readable medium coupled to the processing resource. Thenon-transitory computer readable medium has stored therein instructionsthat when executed by the processing resource cause the processingresource to: receive a set of vectors including at least an unlabeledvector, a first labeled vector, and a second labeled vector; calculate afirst angle between the unlabeled vector and the first labeled vector,and a second angle between the unlabeled vector and the second labeledvector; and use a combination of the first angle and the second angle todetermine a labeling value of the unlabeled vector.

Yet additional embodiments provide non-transitory computer-readablestorage media embodying a set of instructions, which when executed byone or more processing resources of a computer system, causes the one ormore processing resources to: receive a set of vectors including atleast an unlabeled vector, a first labeled vector, and a second labeledvector; calculate a first angle between the unlabeled vector and thefirst labeled vector, and a second angle between the unlabeled vectorand the second labeled vector; and use a combination of the first angleand the second angle to determine a labeling value of the unlabeledvector.

Some embodiments provide methods for modeling data that include:ranking, by a processing resource, a set of unlabeled data elementsbased upon an expected impact of each of the unlabeled data elements onoperation of a first mathematical model to yield a subset of high rankeddata elements, where the subset of high ranked data elements includes atleast one unlabeled data element having a ranking that is higher thananother data element of the set of data elements not included in thesubset of high ranked data elements; training, by the processingresource, the first mathematical model using a first dataset includingboth the subset of high ranked data elements and a set of previouslylabelled data elements to yield at least a first decision output and afirst confidence value corresponding to the first decision output and asecond decision output and a second confidence value corresponding tothe second decision output; applying, by the processing resource, asecond mathematical model to a second dataset including at least thefirst decision output, wherein the second mathematical model provides anupdated decision output corresponding to the first decision output; andselecting, by the processing resource, the first decision output forlabelling based at least in part on a combination of the first decisionoutput and the updated decision output.

In some instances of the aforementioned embodiments, the methods furtherinclude: selecting, by the processing resource, the first decisionoutput for inclusion in the second dataset based upon the firstconfidence value exceeding a confidence threshold value; and excluding,by the processing resource, the second decision output from inclusion inthe second dataset based upon the second confidence value being lessthan the confidence threshold value. In various instances of theaforementioned embodiments, the methods further include comparing, bythe processing resource, the first decision output with one of thepreviously labelled decision outputs to yield a comparison result. Insuch instances, selecting the first decision output for labelling isdone based at least in part on the combination of the first decisionoutput and the updated decision output, and upon the comparison result.In some cases, the comparison result indicates that the one of thepreviously labelled decision outputs is similar to the first decisionoutput.

In various instances of the aforementioned embodiments, selecting thefirst decision output for labelling based at least in part on thecombination of the first decision output and the updated decision outputincludes selecting, by the processing resource, the first decisionoutput for labelling based at least in part on the first decision outputmatching the updated decision output. In some instances of theaforementioned embodiments ranking the set of unlabeled data elementsbased upon the expected impact of each of the unlabeled data elements onoperation of the first mathematical model includes: selecting, by theprocessing resource, an unlabeled data element from a set of unlabeleddata elements to yield a selected, unlabeled data element, whereinnon-selected unlabeled data elements in the set of unlabeled dataelements are a non-selected set of unlabeled data elements; selecting,by the processing resource, a subset of the non-selected set ofunlabeled data elements; merging, by the processing device, theselected, unlabeled data element with the subset of the non-selected setof unlabeled data elements to yield a merged, unlabeled dataset;forming, by the processing device, a union of the merged, unlabeleddataset and a labelled dataset to yield a union dataset; andcalculating, by the processing resource, an expected performance valueof the union dataset. In some such instances where the set of unlabeleddata elements is a set of unlabeled vectors, and the labelled dataset isa set of labelled vectors, the methods further include: generating theset of unlabeled vectors using a combination of at least a firstheuristic and a second heuristic.

Turning to FIG. 1A, an example system 100 including a machine learningmodel training system 110 in accordance with some embodiments. Machinelearning model training system 110 includes a seed vector identificationmodule 132, a multiple process feature calculation module 134, a vectorranking module 136, a model selecting module 138, a model trainingmodule 140, and a labeling module 142.

Seed vector identification module 132 is configured to identify initialvectors for labeling. The process of identifying the seed vectors isprovided below in relation to FIGS. 3A-3E. The resulting identified seedvectors are representative of a particular class for which they arelabeled to yield a set of labeled vectors. Some embodiments discussedherein apply pre-clustering techniques to determine an initial set oflabeled vectors.

Multiple process feature calculation module 134 is configured todetermine multiple heuristics that are in turn provided to a rankingmodel and used in relation to vector ranking. In particular, a number ofheuristics are calculated for each decision output vector provided fromthe model to be trained. Such heuristics may include, but are notlimited to, Shannon's entropy heuristic, a confidence based heuristic, adistance from decision hyperplane heuristic, an orthogonality to labeledpoints heuristic, an information density heuristic, a perturbationheuristic, an expected gradient length heuristic, and/or a consensusbased heuristic. These heuristics are computed for each unlabeled vectorusing information gleaned from the labeled vectors and the problem spaceas a whole. In order to ensure our analysis will detect insightfulvectors critical for labeling rather than vectors our target simplyfails to classify properly, we train multiple models in addition to themodel to be trained in an effort to aid in the heuristic compilationprocess. Based upon the disclosure provided herein, one of ordinaryskill in the art will recognize other heuristics that may be used inrelation to different embodiments.

Vector ranking module 136 is configured to rank vectors based at leastin part on the aforementioned set of heuristics. In some embodiments,vector ranking module 136 is a Vector Querying Neural Network (VQNN)where the heuristics are used to rank the unlabeled vectors as to whichis most important to proper operation of the model to be trained. Ineffect, by feeding the heuristics to such a neural network each strategyrepresented by the respective heuristics are represented in the processof determining a desired vector to label next. Use of such a combinationof heuristics avoids common pitfalls that plague single heuristicstrategies such as selecting outliers to label rather than vectors thatcan greatly improve model accuracy.

An exponential loss function as shown in the following equation isapplied to the ranked vectors to enhance the fineness of the ranking ofvectors around the previously identified region of interest (i.e., yivalues falling in the region of interest):

$\frac{1}{N}{\sum\limits_{i}{{\exp( {- \frac{( {y_{i} - {region}_{interest}} )^{2}}{2\tau^{2}}} )}( {y_{i} - {\hat{y}}_{i}} )^{2}}}$

where y_(i) is the ranking that should have happened, ŷ_(i) (hereinafteralso denoted yihat) is the ranking predicted by the neural networkmodel, N is the number of vectors considered, and τ is a hyper parameterthat controls how quickly weight falloff occurs. One of ordinary skillin the art will appreciate that a correct value of τ can be determinedin a variety of different ways including, but not limited to, anautomated approach using Tree Parsen Estimation. In some embodiments,the region_(interest) is hand selected by one knowledgeable problem set.In other embodiments, an estimated optimum value can be identified byhyperparameter tuning in a similar manner as τ. In some cases of vectorranking discussed herein, the VQNN may be trained using aregion_(interest) equal to 1.

Model to be trained training module 138 is configured to accept a numberof unlabeled and labeled vectors that are used to train a model to betrained, and to determine whether the quality of the output of the modelto be trained is sufficient. In some embodiments, the output of themodel to be trained includes a series of decision output vectors andcorresponding confidence outputs that each indicate a level ofconfidence for a respective one of the series of decision outputvectors. Any approach and/or thresholds known in the art for determiningmodel accuracy may be used. For example, in some embodiments, the modelto be trained is considered sufficiently accurate where more thanninety-five (95) percent of the decision output vectors match the labelapplied to the corresponding input vectors. Based upon the disclosureprovided herein, one of ordinary skill in the art will recognize avariety of thresholds and/or approaches for determining that the modelto be trained is sufficiently accurate.

Oracle input module 140 is configured to receive input indicating astatus of a vector that has been selected for labeling. The selected,unlabeled vectors are selected based upon which have the highest rank.By ranking vectors based upon their expected value to the model to betrained and providing only the highest ranked to the oracle forlabeling, time and effort of the oracle to perform the labeling processis dramatically decreased.

Labeling module 142 is configured to perform automated, adaptivelabeling to vectors that exhibit a high degree of confidence indicated,for example, by exceeding a programmable user threshold of confidence.Labeling module 142 processes such high confidence vectors through amathematical model that validates the decision output. Finally, labelingmodule 142 compares any vectors where the decision output was validatedto previously labeled vectors having the same label as indicated by thedecision output vector. Where a previously labeled vector is found thatis similar to the unlabeled vector under consideration and the labelsfor both would be the same, labeling module 142 labels the unlabeledvector is labeled with the label indicated by the decision output vectorand it is added to the labeled vector set.

Turning to FIG. 1B, an example computer system 160 in which or withwhich embodiments of the present disclosure may be utilized is shown. Asshown in FIG. 1CB computer system 160 includes an external storagedevice 170, a bus 172, a main memory 174, a read-only memory 176, a massstorage device 178, one or more communication ports 180, and one or moreprocessing resources (e.g., processing circuitry 182). In oneembodiment, computer system 160 may be used to perform the functionsdiscussed herein in relation to FIGS. 1A and 2-6 . Those skilled in theart will appreciate that computer system 160 may include more than oneprocessing resource and communication port 180. Non-limiting examples ofprocessing circuitry 182 include, but are not limited to: IntelQuad-Core, Intel i3, Intel i5, Intel i7, Apple M1, AMD Ryzen, or AMD®Opteron® or Athlon MP® processor(s), Motorola® lines of processors,FortiSOC™ system on chip processors or other future processors.Processor 1070 may include various modules associated with embodimentsof the present disclosure.

Communication port 180 can be any of an RS-232 port for use with amodem-based dialup connection, a 10/100 Ethernet port, a Gigabit, 10Gigabit, 25 G, 40 G, and 100 G port using copper or fiber, a serialport, a parallel port, or other existing or future ports. Communicationport 180 may be chosen depending on a network, such as a Local AreaNetwork (LAN), Wide Area Network (WAN), or any network to which thecomputer system connects.

Memory 174 can be Random Access Memory (RAM), or any other dynamicstorage device commonly known in the art. Read only memory 176 can beany static storage device(s) e.g., but not limited to, a ProgrammableRead Only Memory (PROM) chips for storing static information e.g.start-up or BIOS instructions for the processing resource.

Mass storage device 178 may be any current or future mass storagesolution, which can be used to store information and/or instructions.Non-limiting examples of mass storage solutions include ParallelAdvanced Technology Attachment (PATA) or Serial Advanced TechnologyAttachment (SATA) hard disk drives or solid-state drives (internal orexternal, e.g., having Universal Serial Bus (USB) and/or Firewireinterfaces), e.g. those available from Seagate (e.g., the SeagateBarracuda 7200 family) or Hitachi (e.g., the Hitachi Deskstar 7K144),one or more optical discs, Redundant Array of Independent Disks (RAID)storage, e.g. an array of disks (e.g., SATA arrays), available fromvarious vendors including Dot Hill Systems Corp., LaCie, NexsanTechnologies, Inc. and Enhance Technology, Inc.

Bus 172 communicatively couples processing resource(s) with the othermemory, storage and communication blocks. Bus 172 can be, e.g. aPeripheral Component Interconnect (PCI)/PCI Extended (PCI-X) bus, SmallComputer System Interface (SCSI), USB or the like, for connectingexpansion cards, drives and other subsystems as well as other buses,such a front side bus (FSB), which connects processing resources tosoftware system.

Optionally, operator and administrative interfaces, e.g., a display,keyboard, and a cursor control device, may also be coupled to bus 172 tosupport direct operator interaction with computer system. Other operatorand administrative interfaces can be provided through networkconnections connected through communication port 180. External storagedevice 170 can be any kind of external hard-drives, floppy drives,IOMEGA® Zip Drives, Compact Disc—Read Only Memory (CD-ROM), CompactDisc—Re-Writable (CD-RW), Digital Video Disk—Read Only Memory (DVD-ROM).Components described above are meant only to exemplify variouspossibilities. In no way should the aforementioned exemplary computersystem limit the scope of the present disclosure.

While embodiments of the present disclosure have been illustrated anddescribed, numerous modifications, changes, variations, substitutions,and equivalents will be apparent to those skilled in the art. Thus, itwill be appreciated by those of ordinary skill in the art that thediagrams, schematics, illustrations, and the like represent conceptualviews or processes illustrating systems and methods embodying variousnon-limiting examples of embodiments of the present disclosure. Thefunctions of the various elements shown in the figures may be providedthrough the use of dedicated hardware as well as hardware capable ofexecuting associated software. Similarly, any switches shown in thefigures are conceptual only. Their function may be carried out throughthe operation of program logic, through dedicated logic, through theinteraction of program control and dedicated logic, or even manually,the particular technique being selectable by the entity implementing theparticular embodiment. Those of ordinary skill in the art furtherunderstand that the exemplary hardware, software, processes, methods,and/or operating systems described herein are for illustrative purposesand, thus, are not intended to be limited to any particular named. Whilethe foregoing describes various embodiments of the disclosure, other andfurther embodiments may be devised without departing from the basicscope thereof.

Turning to FIG. 2 , a flow diagram shows a method in accordance withvarious embodiments for model development. Following flow diagram 200, aproblem space is selected for modeling (block 202). Such a problem spacemay be any problem space where data is available for training a model tobe trained. Based upon the disclosure provided herein, one of ordinaryskill in the art will recognize a large variety of problem spaces towhich embodiments discussed herein may be applied. Data relevant to theproblem space is obtained (block 203). Any approach known in the art maybe used for obtaining data for a problem space. For example, where theproblem space is identifying malicious emails, large numbers of emailsmay be collected into a database to be used for training the model to betrained.

The type of model to be trained is selected (block 204). As is known inthe art, some model types are more useful for certain types of problemspaces than other models. Such model types may include, but are notlimited to, various classes of neural network models or linearregression models. Based upon the disclosure provided herein, one ofordinary skill in the art will recognize a variety of model types thatmay be selected as a model to be trained in accordance with differentembodiments.

Along with identifying a type of model to be trained (block 204), a usercan also select a region of interest for the problem space (block 206).The region of interest is a region of the dataset where for a reason theuser wants to place particular focus. As one of many examples, assumethe problem space involves identifying malicious emails. In this problemspace clearly malicious emails may be given a value of one thousand(1000) and clearly benign emails may be given a value of zero (0) withall values in between representing a likelihood that a particular emailis malicious. For operational purposes, all emails with a value greaterthan five hundred are considered malicious and all other emails areconsidered benign. As emails with values greater than six hundred (600)exhibit a significant degree of confidence that the email is maliciousand all emails with a value less than four hundred (400) exhibit asignificant degree of confidence that the email is benign, carefulclassification of such emails is not necessary as an error still likelyresults in proper classification. However, for emails with values in therange of four hundred (400) to six hundred (600) an error made as partof the classification process could incorrectly label a benign email asmalicious or a malicious email as benign. Thus, in this case, the regionof interest would be from four hundred (400) to six hundred (600) wherea heightened degree of consideration is desired. This region of interestcomes into play when considering the ranking of vectors to be presentedto an oracle for labeling as more fully discussed below.

Seed vectors within the obtained data are identified and labeled (block208). This block is shown in dashed lines as a more detailed discussionof one embodiment of seed vector identification is provided below inrelation to FIGS. 3A-3E, and the identified seed vectors arerepresentative of a particular class for which they are labeled to yielda set of labeled vectors. Some embodiments discussed herein applypre-clustering techniques to determine an initial set of labeledvectors. Such pre-clustering is applied to high dimensional, multi-classproblem spaces as shown in FIGS. 3A-3E.

The resulting set of labeled vectors along with other unlabeled vectorsfrom the problem space are used to train the model to be trained (block210). As is known in the art, training a mathematical model includesproviding real life data, some of which has been labeled, and adaptivelychanging the model until resulting outputs provided from the modelreflect the labeled data. In embodiments herein, such model training isused not only to train the model to be trained, but also to identifyportions of the data in the problem space that would be highly valuableto the model operation if it was properly labeled.

To the end of identifying portions of the data in the problem space thatwould be highly valuable to the model operation if it was properlylabeled, the outputs from the model to be trained (i.e., a series ofdecision output vectors and corresponding confidence outputs indicatinga level of confidence for each of the series of decision output vectors)are used to: perform multiple process feature calculation, rank thevectors, and to select a subset of the highest ranked vectors (block212). This block is shown in dashed lines as a more detailed discussionof one embodiment of seed vector identification is provided below inrelation to FIGS. 4-6 . In the process, a number of heuristics arecalculated for each decision output vector provided from the model to betrained. Such heuristics may include, but are not limited to, Shannon'sentropy heuristic, a confidence based heuristic, a distance fromdecision hyperplane heuristic, an orthogonality to labeled pointsheuristic, an information density heuristic, a perturbation heuristic,an expected gradient length heuristic, and/or a consensus basedheuristic. These heuristics are computed for each unlabeled vector usinginformation gleaned from the labeled vectors and the problem space as awhole. In order to ensure our analysis will detect insightful vectorscritical for labeling rather than vectors our target simply fails toclassify properly, we train multiple models in addition to the model tobe trained in an effort to aid in the heuristic compilation process.Based upon the disclosure provided herein, one of ordinary skill in theart will recognize other heuristics that may be used in relation todifferent embodiments.

The resulting set of heuristics are provided as a feature set that isfed into a VQNN where the heuristics are used to rank the unlabeledvectors as to which is most important to proper operation of the modelto be trained. In effect, by feeding the heuristics to such a neuralnetwork each strategy represented by the respective heuristics arerepresented in the process of determining a desired vector to labelnext. Use of such a combination of heuristics avoids common pitfallsthat plague single heuristic strategies such as selecting outliers tolabel rather than vectors that can greatly improve model accuracy.

An exponential loss function as shown in the following equation isapplied to the ranked vectors to enhance the fineness of the ranking ofvectors around the previously identified region of interest (i.e., yivalues falling in the region of interest):

$\frac{1}{N}{\sum\limits_{i}{{\exp( {- \frac{( {y_{i} - {region}_{interest}} )^{2}}{2\tau^{2}}} )}( {y_{i} - {\hat{y}}_{i}} )^{2}}}$

where y_(i) is the ranking that should have happened, ŷ_(i) (hereinafteralso denoted yihat) is the ranking predicted by the neural networkmodel, N is the number of vectors considered, and τ is a hyper parameterthat controls how quickly weight falloff occurs. One of ordinary skillin the art will appreciate that a correct value of τ can be determinedin a variety of different ways including, but not limited to, anautomated approach using Tree Parsen Estimation. In some embodiments,the region_(interest) is hand selected by one knowledgeable problem set.In other embodiments, an estimated optimum value can be identified byhyperparameter tuning in a similar manner as τ. In some cases of vectorranking discussed herein, the VQNN may be trained using aregion_(interest) equal to 1.

A small percentage of the unlabeled vectors are selected to be passed toan oracle for labeling (block 214). The selected, unlabeled vectors areselected based upon which have the highest rank. In turn, the oracleapplies labels to these previously unlabeled vectors and incorporatesthe labels into the labeled vector set. In some cases, the oracle is ahuman with knowledge of the problem space. However, in other cases, thelabel may be another non-human source of information about the problemspace. Based upon the disclosure provided herein, one of ordinary skillin the art will recognize a variety of oracles that may be used inrelation to different embodiments. By ranking vectors based upon theirexpected value to the model to be trained and providing only the highestranked to the oracle for labeling, time and effort of the oracle toperform the labeling process is dramatically decreased.

The augmented set of labeled vectors along with other unlabeled vectorsfrom the problem space are used to again train the model to be trained(block 216). Again, the output of the model to be trained includes aseries of decision output vectors and corresponding confidence outputsthat each indicate a level of confidence for a respective one of theseries of decision output vectors. It is determined whether the model tobe trained has achieved sufficient accuracy such that it can be deployedto handle wild unlabeled data (block 218). Any approach and/orthresholds known in the art for determining model accuracy may be used.For example, in some embodiments, the model to be trained is consideredsufficiently accurate where more than ninety-five (95) percent of thedecision output vectors match the label applied to the correspondinginput vectors. Based upon the disclosure provided herein, one ofordinary skill in the art will recognize a variety of thresholds and/orapproaches for determining that the model to be trained is sufficientlyaccurate. Where the model to be trained is sufficiently accurate (block218), the training process ends and the model is deployed (block 222).

Alternatively, where the model to be trained is not sufficientlyaccurate (block 218), the series of decision output vectors andcorresponding confidence outputs from the model to be trained are usedto perform automated, adaptive labeling (block 220). This block is shownin dashed lines as a more detailed discussion of one embodiment ofautomated, adaptive labeling is provided below in relation to FIG. 8 .Such automated, adaptive labeling applies labels to only those vectorsexhibiting the highest degree of confidence. Thus, only decision outputvectors from the model to be trained that have a correspondingconfidence value that exceeds a programmable user threshold areconsidered for labeling. Next, the high confidence vectors are processedthrough a mathematical model that validates the decision output.Finally, the vectors where the decision output was validated arecompared to previously labeled vectors having the same label asindicated by the decision output vector. Where a previously labeledvector is found that is similar to the unlabeled vector underconsideration and the labels for both would be the same, the unlabeledvector is labeled with the label indicated by the decision output vectorand it is added to the labeled vector set. The model to be trained isre-trained using the newly augmented labeled vector set and the processof automated, adaptive labeling is repeated until no decision outputvectors exhibit a confidence value that exceeds the programmable userthreshold. Once no decision output vectors exhibit a confidence valuethat exceeds the programmable user threshold, the process returns toblock 212.

Seed Vector Identification

Turning to FIGS. 3A-3E an automated process for identifying seed vectorsis graphically depicted. In order to start querying vectors, we need aninitial set of data with which we can train a preliminary model.Obtaining such data was discussed above in relation to block 204 of FIG.2 , and the automated process for identifying seed vectors discussed inrelation to FIGS. 3A-3E may be used in some embodiments in place ofblock 208 discussed above in relation to FIG. 2 .

While traditional active learning algorithms create a set of seedvectors using randomly sampled data, some embodiments discussed hereinutilize pre-clustering sampling techniques to determine an initial setof labeled vectors (i.e., seed vectors). Such an approach can lead toimprovements in final model performance. However, use of pre-clusteringtechniques have only been shown to work in low dimensional, binaryclassification tasks. In contrast, some embodiments discussed herein aremodified to allow application of pre-clustering techniques to determineseed vectors in high-dimensional multi-class problem spaces. As usedherein, the phrase “high-dimensional data” is used in its broadest senseto mean a dataset having a number of dimensions that is so high that thenumber of features can exceed the number of observations. Based upon thedisclosure provided herein, one of ordinary skill in the art willrecognize a variety of high dimensional data to which embodimentsdiscussed herein may be applied. It is also noted that embodiments arenot limited to application to high dimensional data, but may be appliedto datasets that are not high dimensional.

Turning specifically to FIG. 3A, a set of high dimensional data 300 isprovided. Again, while the process is described in relation to ahigh-dimensional dataset, the process may also be applied to non-highdimensional datasets. High-dimensional data 300 includes three instancesof a hand-drawn number “1”. High-dimensional data 300 is clustered usingGaussian Mixture Modeling (GMM), using cluster medoids as seeds s isknown in the art. The optimal clustering, assessed by both number ofclusters and distribution of points within clusters, is determined usingthe average silhouette approach as is known in the art. Such clusteringtechniques have performance issues when utilized in high-dimensionaldatasets due to, for example, higher data sparsity and increasedirrelevance of notions of distance.

Some embodiments resolve the data sparsity and increased irrelevance byapplying a manifold learning technique t-Distributed Stochastic NeighborEmbedding (t-SNE) for dimensionality reduction prior to applying theaforementioned clustering. An example of application of t-SNE to highdimensional data 300 is shown in FIG. 3B as a t-SNE reduced dataset 310having a lower dimensional representation than high dimensional data300. As shown in the example, application of t-SNE substantially reducesthe dimensional representation of the input dataset. After applicationof the t-SNE, the aforementioned GMM is applied to cluster a t-SNEreduced dataset 310 to yield the clustered dataset 320 of FIG. 3C. Inparticular, clusters of data 360, 361, 362, 363, 364, 365, 366, 367,368, 369, 370 (outlined with circles) are found. Then, as shown in FIG.3D, a medoid for each of the respective clusters of data 360, 361, 362,363, 364, 365, 366, 367, 368, 369, 370 is calculated (i.e., medoids 380,381, 382, 383, 384, 385, 386, 387, 388, 389, 390). Each of the medoids(in this example, twelve medoids) are then provided in their originaldata format as shown as seed vectors 340 of FIG. 3E. In this example,the process provides seed vectors 340 that can be used to identify allthree of the hand-drawn instances of the number “1” found inhigh-dimensional data 300. Further understanding of the above-describedseed vector identification approach is set forth in U.S. patentapplication Ser. No. 17/018,930 entitled “CONVEX OPTIMIZED STOCHASTICVECTOR SAMPLING BASED REPRESENTATION OF GROUND TRUTH”, and filed byKhanna on Sep. 11, 2020. The entirety of the aforementioned reference isincorporated herein by reference for all purposes.

Vector Ranking Features

Turning to FIG. 4 , a flow diagram 400 shows a method for performingmultiple process feature calculation on each of the decision outputvectors provided from the model to be trained in accordance with someembodiments. Following flow diagram 400, once there are some labeledvectors (e.g., provided in accordance with FIGS. 3A-3E above), apreliminary model is trained on the labeled vectors that aims toclassify other unlabeled data. In order to identify the optimal vectorto query the oracle about, a number of feature identification heuristicsare applied to each unlabeled vector in the problem space. These thedecision output vectors provided from the model to be trained are shownin FIG. 4 as input vectors.

In particular, Shannon's entropy heuristic may be applied to each of thedecision output vectors provided from a model to be trained to yieldrespective SE features (block 402). Shannon's entropy is a metric thatrepresents the total amount of information stored in a distribution, andis typically thought of as a measure of uncertainty in the field ofmachine learning. Shannon's entropy may be defined by the followingequation:

argmax_(x)−Σ_(i) p(y _(i) |x;Θ)log(p((y _(i) |x;Θ).  (9)

The more uniform a distribution is, the larger the entropy of thedistribution. A model with a high confidence or probability score for aparticular class will have low entropy, whereas a model that is notconfident in deciding between classes will have high entropy, making themetric ideal for modeling uncertainty. The model to be trained may beused to determine the aforementioned Shannon's entropy heuristic.

Additionally, a confidence based heuristic may be applied to each of theof the decision output vectors using corresponding confidence valuesprovided from the model to be trained to yield respective CB features(block 404). Entropy takes into account uncertainty across all availableclasses, but a model may have a hard time deciding between two classes.A margin of confidence (MC) defined by:

1−(p(y* ₍₁₎ |x;Θ)−p(y* ₍₂₎ |x;Θ)), and/or

a ratio of confidence (RC) determined by:

(p(y* ₍₁₎ |x;Θ)/p(y* ₍₂₎ |x;Θ)),

may be determined using the model to be trained. Here, y*_((n)) denotesthe n^(th) most likely class based on the model's predictionprobabilities. MC is the difference between the top two most confidentpredictions, while RC is their ratio.

An alternative approach is simply choosing the point whoseclassification the model has the Lowest Confidence (LC) in, as is shownin its formula argmin, p(y₍₁₎|x). Despite its simplicity, LC works wellwith conditional random fields as well as for active learning ininformation extraction tasks. Thus, in different embodiments, the CB maybe a different one of LC, MC, or RC. Such LC feature determination maybe determined using the model to be trained.

Additionally, a distance from hyperplane heuristic may be applied toeach of the of the decision output vectors provided from the model to betrained to yield respective DH features (block 406). One potentialstrategy for labeling points is to choose points we expect to maximallynarrow the existing margins. The location of a vector with respect to adecision boundary determines the magnitude its labeling changes decisionboundary position, with closer vectors having a greater affect.Different problem spaces will have differing dimensions, and varyingseparation between classes. In order to utilize metrics across problemspaces, we scale a vector's boundary distance by the average distancefor all points in the problem space. The DH features may be determinedusing a linear support vector machine (SVM), a Sigmoid SVM, a radialbasis function (RBF) SVM, or a polynomial SVM.

Additionally, an orthogonality heuristic may be applied to each of theof the decision output vectors provided from the model to be trained toyield respective OR features (block 408). When performing activelearning in high dimensional problem spaces, it is easy for algorithmsto ignore particular dimensions or pockets within a problem space due tothe nature of having dimensions that are orders of magnitude larger thanthe number of examples. This can lead to a major disconnect between thedecision boundaries of the model to be trained and the true underlyingclass separation. By searching for examples that are orthogonal to thespace spanned by the set of labeled data, the learner is giveninformation about dimensions that have not yet been explored. In orderto utilize these principles even in problem spaces of lowerdimensionality or with higher space coverage, this constraint is relaxedto allow for vectors with large angles to be selected. In someembodiments, the orthogonality metric is defined by the followingequation:

_(∈L) cos⁻¹(<x _(i) ,

>/|x _(i)|

),

finds the smallest angle between the unlabeled vector x_(i) in questionand the vectors in the labeled set L.

Additionally, an information density heuristic is applied to each of theof the decision output vectors provided from the model to be train toyield respective ID features (block 410). Many active learningalgorithms aim to query vectors our given model is most uncertain of,leading to a proclivity to query outliers whose labeling will havelittle to no effect on model performance. This motivating factor led tothe development of the information density framework (IDF) defined by:

(arg max_(x)Φ_(A)(x))(1/UΣ _(u) sin(x,x ^((u))))^(β).

Manipulating IDF, an information density metric (IDM) can be coined asfollows:

1/UΣ _(u) sin(x,x ^((u))).

IDM aims to scale the strategy by weighing it against the averagesimilarity to all other instances in the input distribution. In theequation, sim refers to a similarity function such as cosine similarity,the dot product between normalized vectors, or Euclidean similarity,which is the reciprocal of Euclidean distance. The higher theinformation density, the more similar the given instance is to the restof the data. While Cosine IDM defines the centermost cluster as mostimportant, Euclidean IDM prefers the center of clusters.

Additionally, a perturbation heuristic may be applied to each of the ofthe decision output vectors provided from the model to be train to yieldrespective PE features (block 412). The usefulness of active learningcan be extended for all model types by identifying the maximal shift inmodel confidence incurred by adding perturbation to each unlabeledvector. Let ϵ˜

(0, 1), then calculate:

D _(KL)(p(y|x)∥p(y|x+ϵ);

In other words, the Kullback-Leibler divergence (D_(K L)) of the model'sprediction probabilities is calculated for a given vector before andafter adding perturbation. The larger the divergence after adding €, themore crucial a label is to improve model performance. Said another way,the aforementioned perturbation heuristic involves processing a vectorto determine a first predicted result that corresponds to the vector,and in additional adding noise to the same vector and processing thenoise augmented vector to determine a second predicted result. The firstpredicted result is then compared with the second predicted result toyield a difference that is attributed to the addition of the noise. Asan example, where the first predicted result identifies a differentclass than the second predicted result and the change is significant,the vector is one that lies at a junction of the classes (perhaps, forexample, in a region of interest as described above) and thus representsa vector that is a better candidate for labeling by an oracle than othervectors where a change in class or a change in class, but only a smalldifference is noted. Thus, the addition of noise does not test howstrong or robust the model is, but rather flags vectors that are morevaluable to training a model. Such PE feature determination may bedetermined using the model to be trained.

Additionally, an expected gradient heuristic may be applied to the inputvectors to yield respective EG features (block 414). Discriminativemodels are typically trained using gradient-based optimization; theamount a model will be changed at a given time can be quantified by theexpected gradient length. In order to make the largest updates to themodel possible, it will be optimal to choose a vector x that leads tothe largest change in our objective function 1. as determined via thefollowing equation:

arg max_(x)Σ_(i) p(y _(i) |x;Θ)∥∇I,U(x|y _(i);Θ)∥.

The vector's gradient for a possible class is scaled by its predictionprobability as output by the current model. Such EG featuredetermination may be determined using a Softmax Regression model.

Additionally, a consensus based heuristic may be applied to the inputvectors to yield respective CB features (block 416). Such consensusbased strategies utilize multiple models in various combinations inorder to identify vectors of interest. Query-by-committee consensus hasa committee composed of multiple models trained on our set of labeleddata with each model having a unique initialization. Co-Training andCo-Learning approach consensus through different lenses, using differingsubsets of features and using different model types altogetherrespectively. No matter the consensus strategy, they all function in asimilar way. The vectors that models disagree the most over have themost potential information to give; these vectors are the most optimalto label. The aforementioned Query by committee and Co-Training featuredeterminations may be determined using the model to be trained, and theaforementioned Co-Learning feature determination may be determined usinga Percepton model, a Random Forest model, or a Softmax regression model.While FIG. 4 is described as using the aforementioned algorithms andstrategies to identify features of input vectors, one of ordinary skillin the art will appreciate other algorithms and/or strategies that maybe used in addition to or in place of one or more of the algorithms andstrategies discussed above.

Vector Ranking

Turning to FIG. 5 , a flow diagram 500 shows a method in accordance withsome embodiments for ranking unlabeled input vectors in the data fromthe problem space using multiple features determined and/or calculatedusing different feature generation processes (e.g., the various featuresgenerated using the method in flow diagram 400 discussed above inrelation to FIG. 4 ). In some embodiments, the processes of flow diagram500 may be implemented in a VQNN. Following flow diagram 500, theprocess is repeated for each unlabeled vector and thus begins beforeeach processing of an unlabeled input vector by determining whether anyunlabeled input vectors remain to be processed (block 502). Where one ormore unlabeled input vectors remain to be processed (block 502), a stepvalue is initialized to zero (0) and a size value is set equal to adefault value (block 504). In some embodiments, the size value may beuser programmable. Based upon the disclosure provided herein, one ofordinary skill in the art will recognize a variety of step values and/orsize values that may be used in relation to different embodiments.

The next one of the unlabeled input vectors is selected for potentiallabeling (block 506) and this selected unlabeled input vector is removedfrom the other unlabeled input vectors and set aside (block 508).Turning to FIG. 6A, an example of a set of labeled input vectors (X_(L)^((n))) 600 and a set of unlabeled input vectors (X_(U) ^((n))) 650 areshown. One of unlabeled input vectors 650 has been selected (X_(U) ⁽⁵⁾)as indicated by the dashed box 652. Returning to FIG. 5 , the unlabeledinput vectors remaining after removal of the selected unlabeled inputvector are shuffled (i.e., the order of the vectors is changed). Turningto FIG. 6B, selected unlabeled input vector (X_(U) ⁽⁵⁾) 652 has beenremoved from unlabeled input vectors 650, leaving only unlabeled inputvectors 654. Unlabeled input vectors 654 remaining (i.e., originallyordered X_(U) ⁽¹⁾, X_(U) ⁽²⁾, X_(U) ⁽³⁾, X_(U) ⁽⁴⁾, X_(U) ⁽⁶⁾, X_(U)⁽⁷⁾, X_(U) ⁽⁸⁾, X_(U) ⁽⁹⁾, and X_(U) ⁽¹⁰⁾) after removal of selectedunlabeled input vector (X_(U) ⁽⁵⁾) 652 are shuffled to yield an orderX_(U) ⁽¹⁰⁾, X_(U) ⁽¹⁾, X_(U) ⁽⁹⁾, X_(U) ⁽⁶⁾, X_(U) ⁽²⁾, X_(U) ⁽⁴⁾, X_(U)⁽³⁾, X_(U) ⁽⁸⁾, and X_(U) ⁽⁷⁾.

A subset of the unlabeled input vectors remaining after removal of theselected unlabeled input vector is selected using the step value and thesize value (block 512). Thus, as an example, where the step value iszero (0) and the size value is four (4) the first four vectors of theremaining unlabeled input vectors are selected. As another example,where the step value is one (1) and the size value is eight (8) thesecond through the ninth of the remaining unlabeled input vectors areselected. The selected subset of the remaining unlabeled input vectorsare merged with the selected unlabeled input vector to yield a union ofunlabeled input vectors (block 514). Turning to FIG. 6C, an example, fora step value of zero (0) and a size value of five (5) is shown. Asshown, the first five vectors (i.e., X_(U) ⁽¹⁰⁾, X_(U) ⁽¹⁾, X_(U) ⁽⁹⁾,X_(U) ⁽⁶⁾, and X_(U) ⁽²⁾) are selected as a subset 658, and subset 658is joined with selected unlabeled input vector 652 to form a subset ofunlabeled input vectors 660.

Returning to FIG. 5 , a union of the subset of unlabeled input vectorsand the labeled input vectors is formed (block 516). Turning to FIG. 6D,an example of a union 680 subset of unlabeled input vectors 660 andlabeled input vectors 600 is shown. Returning to FIG. 5 , a minimumexpected performance value and an optimal expected performance value forthe union is calculated (block 518). In some embodiments, the expectedperformance values are calculated in accordance with the followingequations:

x _(min)=arg min_(x) |E _(future); and

x _(optimal;)=arg max_(x) |E _(future).

E_(future) is the expected effect of labeling the vector on futureperformance of other unlabeled vectors in the future. In layman terms, asliding window is used to select a group of vectors to label alongsidethe vector currently in consideration; each time the window slides thegroup of vectors to label changes but the vector we are considering willalways be a part of the set. For each group of vectors selected, eachvector in the group is added to the labeled vector set and totalincrease in performance is evaluate. After evaluation, the group ofvectors are removed from the labeled vector set. After all groups ofvectors are tried out all groupings, the average model increase for eachof the groups that included the considered vector is calculated. Thisallows for estimation of the performance of the model after labeling theconsidered vector in the future, after other vectors have been labeledas well. Such a comparison value is the E_(future) of the foregoingequations. In order to convert the Efuture values to rankings, theintermediate values listed above are calculated to facilitate thistransformation. Such a roundabout way to determine the best vectors tolabel is used as it is helpful to consider how a vector carves up thesearch space of the unlabeled vector set if it were to be labeled.Supbar vector selection can dramatically hamper how effective thelabeling process becomes when future vectors are considered forlabeling, leading to diminishing returns rapidly. For this reason, allof the unlabeled vectors are considered rather than simply determiningwhich vector is closest to the expected result. The aforementionedvalues are stored in relation to the selected unlabeled input vector andthe particular union.

It is determined whether another union is possible for the selectedunlabeled input vector (block 520). Another union is possible where thestep value plus one (1) plus the size value does not extend beyond theend of number of unlabeled input vectors remaining after removal of theselected unlabeled input vector. Where another union is possible (block520), the step value is incremented (block 522) and the processes ofblocks 512-520 are repeated for the selected unlabeled input vectorusing the new step value and the previously set size value. Turning toFIG. 6E, an example is shown where the step value is incremented to one(1) (it was previously zero (0)). As shown, a subset of unlabeled inputvectors 664 is created from a combination of selected unlabeled inputlabel 652 and a subset 662 selected using the step value (i.e., 1) andthe size value (i.e., 5).

Returning to FIG. 5 , where no other unions with the selected unlabeledinput vector are possible (block 520), the previously selected unlabeledinput vector is returned to the other unlabeled input vectors and it isdetermined whether any of the unlabeled input vectors remain to beselected and processed (block 502). Where additional unlabeled inputvectors remain to be processed (block 502), the processes of blocks504-522 are repeated for the next of the unlabeled input vectors.

Alternatively, where no unlabeled input vectors remain to be processed(block 502), all of the unlabeled input vectors are ranked using theaverage of all expected performance values for the multiple unions inwhich the respective unlabeled input vector was processed (block 524).This includes averaging all of the x_(min) values for the unions inwhich the respective unlabeled input vector was processed to yield anx_(min,average) value; and averaging all of the x_(optimal) values forthe unions in which the respective unlabeled input vector was processedto yield an x_(optimal,average) value. Using these average values, arank for the respective vector is calculated in accordance with thefollowing equation:

rank_(x)=(x−x _(min,average))/(x _(optimal,average) −x _(min,average)).

For the most optimal vectors (in this case the highest ranked vectors)extra care is taken to assure that the relative rankings are accurate.This helps to assure that the best vector(s) are ultimately selected forlabeling. To this end, in some embodiment the following loss function isapplied:

$\frac{1}{N}{\sum\limits_{i}{{\exp( {- \frac{( {y_{i} - 1} )^{2}}{2\tau^{2}}} )}( {y_{i} - {\hat{y}}_{i}} )^{2}}}$

where yi is the true ranking, yihat is the predicted ranking, N is thenumber of vectors considered, and τ is a hyper parameter that controlshow quickly weight falloff occurs.

In some embodiments, training τ relies on synthetic data rather thanreal world datasets due to the relatively low costs involved inobtaining additional problem spaces to incorporate into our trainingset. Such an approach allows for creation of larger and more powerfulmodels that otherwise would have suffered from over-fitting less as theamount of data increases. Such an approach can yield a large set ofpotential values for τ. To reduce this large set, an optimizationapproach relying on Tree Parzen Estimation (TPE) can be used. Since TPEtracks previous evaluation results in order to map hyper parameter setsto probabilistic models, this enabled us to tune τ faster and hasempirically shown can lead to better results than alternative approachesto hyper parameter tuning.

Turning to FIG. 7 , an example VQNN 700 is shown that may be used toperform the vector ranking processes discussed in relation to FIG. 5 .VQNN 700 uses hidden layers having Tan h activation (Tan h DensityConnected Network Units 704, and Tan h Double Residual units 706, 708)with later layers (Tan h Density Connected Network Units 710, and Tan hDouble Residual units 712, 714) being slightly larger than those in thebeginning. Since vector rankings are expressed in non-negative values,an rectified linear unit output 716 is used

Automated, Adaptive Vector Labeling

Turning to FIG. 8 , a flow diagram 800 shows a method for automated,adaptive vector labeling in accordance with various embodiments.Following flow diagram 800, the decision output vectors and confidenceoutputs from the model to be trained is received after completion of avector ranking and non-automated labeling process is received (block802). The decision output vectors each indicates what the model to betrained believes the corresponding input vector to represent, and theconfidence output indicates the degree of confidence the model has thatthe decision output vector is correct. As one example, the decisionoutput vectors and corresponding confidence outputs may be provided as aresult of the model training performed in relation to block 214 of FIG.2 .

Each of the decision output vectors that are both unlabeled and exhibita confidence greater than a programmable threshold value are selected toyield high confidence, unlabeled vectors (bock 804). An auto-annotationclassification model is applied to the high confidence, unlabeledvectors to classify the individual vectors for labeling. In someembodiments, the auto-annotation classification model is implemented asa vector pseudo labeling neural network (VPLNN) operates to predictwhether the given vector has been correctly labeled by the model to betrained using the received decision output vector and correspondingconfidence value. The auto-annotation classification model provides anoutput indicating that the particular high confidence, unlabeled vectorwas validly labeled by the model to be trained, or indicating that theparticular high confidence, unlabeled vector was not validly labeled bythe model to be trained.

Each of the high confidence, unlabeled vectors processed by theauto-annotation classification model are then processed (block 808).This processing continues until all of the high confidence, unlabeledvectors have been considered. Where another high confidence, unlabeledvector remains to be processed (block 808), it is determined whether theapplication of the auto-annotation classification model found the labelapplied by the model to be trained was valid (i.e., correct) (block810). Where the application of the auto-annotation classification modeldid not find the label applied by the model to be trained valid (block810), the next high confidence, unlabeled vector is selected forprocessing (block 808).

Alternatively, where the label was found valid (block 810), theparticular high confidence, unlabeled vector is compared with otherlabeled vectors that have the same label to determine whether theparticular high confidence, unlabeled vector is similar to at least oneother previously labeled vector (block 812). This similarity comparisonis performed to ensure that the vector satisfies the smoothnessconstraint, where vectors of the same class are closer in distance toeach other than they are to vectors of a differing class. Enforcement ofthis constraint can be performed using a variety of distancemeasurements, such as Euclidean distance, Manhattan distance, as well asMahalanobis Distance. For example, if our target model labels a vectorwith high confidence as a dog and our VAANN identifies the vector asbeing correctly classified, yet it is closest to a vector correspondingto a cat, we will not annotate the vector. However, if the same vectorwas indeed closest to another dog vector, then we can annotate thisvector as a dog with certainty. By assuring that the particular highconfidence, unlabeled vector is similar to at least one other previouslylabeled vector, any labeling that is ultimately applied will not be tovectors that are novel. While such novel vectors may have beenaccurately predicted for labeling, the labeling of novel vectors ispreserved for the oracle to reduce the possibility of introducingmis-labeled vectors in the automated labeling process, and the damagethat such cause to the model to be trained.

Where the particular high confidence, unlabeled vector is similar to atleast one other previously labeled vector (block 812), the predictedlabel is added to the particular high confidence, unlabeled vector andthe newly labeled vector is added to the growing list of labeled vectors(block 814). Our target model is then retrained on the modified set oflabeled vectors. Either where the particular high confidence, unlabeledvector is not similar to at least one other previously labeled vector(block 812) or labeling of the vector has been done (block 814), thenext high confidence, unlabeled vector is selected for processing (block808).

Once no other high confidence, unlabeled vectors remain for processing(block 808), the model to be trained is again trained using theaugmented labeled dataset along with other unlabeled vectors in the datafrom the problem space (block 816). The confidence value outputs fromthe model to be trained are queried to determine whether the processresulted in any decision output vectors with a confidence that exceedsthe programmable user threshold (block 818). Where additional decisionoutput vectors were found with a confidence that exceeds theprogrammable user threshold (block 818), the process of automated,adaptive labeling is repeated. Otherwise, the process of automated,adaptive labeling is terminated and processing is returned to vectorranking and non-automated vector labeling (e.g., block 212 of FIG. 2 ).

Once no other high confidence, unlabeled vectors remain for processing(block 808), the model to be trained is again trained using theaugmented labeled dataset along with other unlabeled vectors in the datafrom the problem space (block 816). The confidence value outputs fromthe model to be trained are queried to determine whether the processresulted in any decision output vectors with a confidence that exceedsthe programmable user threshold (block 818). Where additional decisionoutput vectors were found with a confidence that exceeds theprogrammable user threshold (block 818), the process of automated,adaptive labeling is repeated. Otherwise, the process of automated,adaptive labeling is terminated and processing is returned to vectorranking and non-automated vector labeling (e.g., block 212 of FIG. 2 ).

Turning to FIG. 9 , an example Vector Pseudo labeling Neural Network(VPLNN) 900 is shown that may be used to perform the vector labelingprocesses discussed in relation to FIG. 9 . VPLNN 900 is a ReLU focusedarchitecture using a series of ReLU Density Residual Units 702, 704,706, 708, 710, 712, 714, 716, 718, 720 and a Sigmoid Unit Output 722.

Turning to FIG. 10 , a dense residual unit (DRU) 1010 is shown that maybe used to implement the aforementioned VQNN and VPLNN systems inrelation to various embodiments discussed herein. It is noted that whileDRU 1010 is shown with a ReLU activation function that other activationfunctions are possible in accordance with other embodiments. Suchactivation functions may include, but are not limited to, Tan h orSigmoid activation functions. Based upon the disclosure provided hereinone of ordinary skill in the art will recognize a variety of activationfunctions that may be used in DRU 1010 in accordance with differentembodiments. The inputs to the VPLNN and the VQNN are the same. What isdifferent, however, is the use of the output of the VPLNN and the outputof the VQNN.

Turning to FIG. 11 , a flow diagram 1100 shows a method in accordancewith some embodiments for using perturbation to identify high valuelabeling targets. Following flow diagram 1100, it is determined whetheranother data element in a set of data elements remains to be processed(block 1102). The processes of flow diagram are repeated for eachelement within a set of data elements in an effort to identify any dataelements that would likely yield value to a model if they were labeled(i.e., high value labeling targets). The first or next data element inthe set of data elements is selected for processing (block 1104). Duringthe first time the processes of flow diagram 1200 are to be applied anydata element (i.e., a first data element) from the set of data elementsis selected for processing, during subsequent times the processes offlow diagram 1200 are to be applied any previously unprocessed dataelement (i.e., a next data element) from the set data elements isselected for processing.

A mathematical model is applied to the original set of data elementsincluding the selected data element to yield a corresponding set ofpredictive outputs (block 1106). One of the set of predictive dataelements corresponds to the selected data element. A perturbation isadded to the selected data element to yield a perturbed data elementthat corresponds to the selected data element (block 1108).

The same mathematical model is applied to the original set of dataelements modified to replace the selected data element with theperturbed data element (block 1110). Application of the mathematicalmodel yields a perturbed set of predictive outputs that includes aperturbed predictive output corresponding to the perturbed data element.

It is determined whether the first predictive output indicates a classthat is different from a class indicated by the perturbed predictiveoutput (block 1112). Where adding the perturbation to the selected dataelement causes the mathematical model to predict a different class, thenthe perturbation made a significant difference to the mathematicalmodel. As such, the selected data element is considered a high valuelabeling target and the selected data element is identified as a highvalue labeling target (block 1114). Otherwise, the selected data elementis identified as a low value labeling target (block 1116). The processesof blocks 1104-1116 are repeated for each data element in the set ofdata elements and identified as a high value labeling target or not.This identification information is used in relation to the labelingprocesses discussed above in relation to FIGS. 4-9

Turning to FIG. 12 , a flow diagram 1200 shows another method inaccordance with some embodiments for using perturbation to identify highvalue labeling targets. Following flow diagram 1200, it is determinedwhether another data element in a set of data elements remains to beprocessed (block 1202). The processes of flow diagram are repeated foreach element within a set of data elements in an effort to identify anydata elements that would likely yield value to a model if they werelabeled (i.e., high value labeling targets). The first or next dataelement in the set of data elements is selected for processing (block1204). During the first time the processes of flow diagram 1200 are tobe applied any data element (i.e., a first data element) from the set ofdata elements is selected for processing, during subsequent times theprocesses of flow diagram 1200 are to be applied any previouslyunprocessed data element (i.e., a next data element) from the set dataelements is selected for processing.

A mathematical model is applied to the original set of data elementsincluding the selected data element to yield a corresponding set ofpredictive outputs (block 1206). One of the set of predictive dataelements corresponds to the selected data element. A perturbation isadded to the selected data element to yield a perturbed data elementthat corresponds to the selected data element (block 1208).

The same mathematical model is applied to the original set of dataelements modified to replace the selected data element with theperturbed data element (block 1210). Application of the mathematicalmodel yields a perturbed set of predictive outputs that includes aperturbed predictive output corresponding to the perturbed data element.

A first divergence corresponding to the first predictive output and asecond divergence corresponding to the perturbed predictive output arecalculated (block 1212). Each of the aforementioned divergence valuesare calculated in accordance with the following equation:

D _(KL)(p(y|x)∥p(y|x+ϵ),

where ϵ˜

(0, 1). Then, a difference between the first divergence and the seconddivergence is calculated to yield a divergence difference (block 1214).This divergence difference is an indication of how significant of achange the addition of perturbation to the selected data element yieldedin the output of the mathematical model. Data elements that whenperturbed yield the most significant divergence difference are goodcandidates for labeling. In contrast, data elements that when perturbedyield only lesser changes in the output of the mathematical model areless important when being considered for labeling.

The magnitude of the divergence difference is compared against athreshold value (block 1216). In some cases, the threshold value is userprogrammable. Where the magnitude of the divergence difference exceedsthe threshold value (block 1216), the selected data element isidentified as a high value labeling target (block 1218). Otherwise, theselected data element is identified as a low value labeling target(block 1220). The processes of blocks 1204-1220 are repeated for eachdata element in the set of data elements and identified as a high valuelabeling target or not. This identification information is used inrelation to the labeling processes discussed above in relation to FIGS.4-9 .

Turning to FIG. 13 , a flow diagram 1300 shows a method in accordancewith some embodiments for using an orthogonality heuristic to identifyignored labeling targets. Following flow diagram 1300, a set of anglevalues for a selected unlabeled data vector is initialized as null(block 1302). This set of angle values is used in the process of flowdiagram 1300 to hold all of the angle values calculated between aselected unlabeled data vector and each of the labeled data vectorsincluded in a set of data vectors that are being processed.

It is determined whether another unlabeled data vector remains forprocessing in a set of data vectors that includes both labeled datavectors and unlabeled data vectors (block 1304). The processes of flowdiagram are repeated for each unlabeled data vector within the set ofdata vectors in an effort to identify any data vectors that are likelyto be ignored and may yield value to a model if they were labeled (i.e.,ignored labeling targets). The first or next unlabeled data vector inthe set of data vectors is selected for processing (block 1306). Duringthe first time the processes of flow diagram 1300 are to be applied anyunlabeled data vector (i.e., a first unlabeled data vector) from the setof data vectors is selected for processing, during subsequent times theprocesses of flow diagram 1300 are to be applied any previouslyunprocessed, unlabeled data vector (i.e., a next unlabeled data vector)from the set data vectors is selected for processing.

It is determined whether another labeled data vector remains forprocessing in the set of data vectors (block 1308). The process of flowdiagram 1300 considers all labeled vectors in relation to the selectedunlabeled data vector (i.e., the unlabeled data vector selected in block1306). Where another labeled data vector remains for consideration(block 1310), The first or next unlabeled data vector in the set of datavectors is selected for processing (block 1306). During the first timethe processes of blocks 1308-1314 are applied, any labeled data vector(i.e., a first labeled data vector) from the set of data vectors isselected for processing, during subsequent times any previouslyunconsidered, labeled data vector (i.e., a next labeled data vector)from the set data vectors is selected for processing.

An angle between the selected unlabeled data vector and the selectedlabeled data vector is calculated to yield an angle value (block 1312).This angle value may be calculated using any approach known in the artfor calculating an angle between two vectors. This calculated anglevalue is included in the set of angle values for the selected unlabeledvector (block 1314). Again the processes of blocks 1308-1314 arerepeated for the selected unlabeled data vector and each of the labeleddata vectors in the set of data vectors.

Once an angle value between the selected unlabeled vector and each ofthe labeled data vectors in the set of data vectors has been calculatedand included in the set of angle values for the selected unlabeledvector (block 1308), a minimum angle within the set of angle values isidentified (block 1316). This minimum angle is the minimum angle betweenthe selected unlabeled data vector and any labeled data vector withinthe set of data vectors. This minimum angle is compared with a thresholdvalue (block 1318). Where the minimum angle is greater than a thresholdvalue (block 1318), the selected unlabeled data vector is identified asan ignored labeling target (block 1320). Otherwise, the selectedunlabeled data vector is identified as a non-ignored labeling target(block 1322).

While embodiments of the present disclosure have been illustrated anddescribed, numerous modifications, changes, variations, substitutions,and equivalents will be apparent to those skilled in the art. Thus, itwill be appreciated by those of ordinary skill in the art that thediagrams, schematics, illustrations, and the like represent conceptualviews or processes illustrating systems and methods embodying variousnon-limiting examples of embodiments of the present disclosure. Thefunctions of the various elements shown in the figures may be providedthrough the use of dedicated hardware as well as hardware capable ofexecuting associated software. Similarly, any switches shown in thefigures are conceptual only. Their function may be carried out throughthe operation of program logic, through dedicated logic, through theinteraction of program control and dedicated logic, or even manually,the particular technique being selectable by the entity implementing theparticular embodiment. Those of ordinary skill in the art furtherunderstand that the exemplary hardware, software, processes, methods,and/or operating systems described herein are for illustrative purposesand, thus, are not intended to be limited to any particular named. Whilethe foregoing describes various embodiments of the disclosure, other andfurther embodiments may be devised without departing from the basicscope thereof.

What is claimed is:
 1. A method for identifying an ignored labelingtarget, the method comprising: receiving, by a processing resource, aset of vectors including at least an unlabelled vector, a first labelledvector, and a second labelled vector; calculating, by the processingresource, a first angle between the unlabelled vector and the firstlabelled vector, and a second angle between the unlabelled vector andthe second labelled vector; and using, by the processing resource, acombination of the first angle and the second angle to determine alabeling value of the unlabelled vector.
 2. The method of claim 1,wherein using the combination of the first angle and the second angle todetermine a labeling value of the unlabelled vector includes:determining, by the processing resource, that the first angle is lessthan the second angle; and identifying, by the processing resource, thefirst angle as a minimum angle based at least in part on determiningthat the first angle is less than the second angle.
 3. The method ofclaim 2, wherein using the combination of the first angle and the secondangle to determine a labeling value of the unlabelled vector furtherincludes: comparing, by the processing resource, the minimum angle witha threshold value.
 4. The method of claim 3, wherein using thecombination of the first angle and the second angle to determine alabeling value of the unlabelled vector further includes: identifying,by the processing resource, the unlabelled vector as a high valuelabeling target where the minimum angle exceeds the threshold value. 5.The method of claim 3, wherein the threshold value is user programmable.6. The method of claim 1, the method further comprising: using, by theprocessing resource, the labeling value of the unlabelled vector alongwith the result of at least one other heuristic to rank the unlabelledvector relative to other unlabelled vectors.
 7. The method of claim 6,wherein the at least one other heuristic is selected from a groupconsisting of: a Shannon's entropy heuristic, a confidence basedheuristic, a distance from decision hyperplane heuristic, an informationdensity heuristic, a perturbation heuristic, an expected gradient lengthheuristic, and a consensus based heuristic.
 8. A system for identifyingan ignored labeling target, the system comprising: a processingresource; a non-transitory computer-readable medium, coupled to theprocessing resource, having stored therein instructions that whenexecuted by the processing resource cause the processing resource to:receive a set of vectors including at least an unlabelled vector, afirst labelled vector, and a second labelled vector; calculate a firstangle between the unlabelled vector and the first labelled vector, and asecond angle between the unlabelled vector and the second labelledvector; and use a combination of the first angle and the second angle todetermine a labeling value of the unlabelled vector.
 9. The system ofclaim 1, wherein the instructions that when executed by the processingresource cause the processing resource to use the combination of thefirst angle and the second angle to determine the labeling value of theunlabelled vector include instructions that cause the processingresource to: determine that the first angle is less than the secondangle; and identify the first angle as a minimum angle based at least inpart on determining that the first angle is less than the second angle.10. The system of claim 9, wherein the instructions that when executedby the processing resource cause the processing resource to thecombination of the first angle and the second angle to determine alabeling value of the unlabelled vector further include instructionsthat cause the processing resource to: compare the minimum angle with athreshold value.
 11. The system of claim 10, wherein the instructionsthat when executed by the processing resource cause the processingresource to the combination of the first angle and the second angle todetermine a labeling value of the unlabelled vector further includeinstructions that cause the processing resource to: identify theunlabelled vector as a high value labeling target where the minimumangle exceeds the threshold value.
 12. The system of claim 10, whereinthe threshold value is user programmable.
 13. The system of claim 8,wherein the instructions that when executed by the processing resourcefurther cause the processing resource to: using the labeling value ofthe unlabelled vector along with the result of at least one otherheuristic to rank the unlabelled vector relative to other unlabelledvectors.
 14. The system of claim 13, wherein the at least one otherheuristic is selected from a group consisting of: a Shannon's entropyheuristic, a confidence based heuristic, a distance from decisionhyperplane heuristic, an information density heuristic, a perturbationheuristic, an expected gradient length heuristic, and a consensus basedheuristic.
 15. A non-transitory computer-readable storage mediumembodying a set of instructions, which when executed by one or moreprocessing resources of a computer system, causes the one or moreprocessing resources to: receive a set of vectors including at least anunlabelled vector, a first labelled vector, and a second labelledvector; calculate a first angle between the unlabelled vector and thefirst labelled vector, and a second angle between the unlabelled vectorand the second labelled vector; and use a combination of the first angleand the second angle to determine a labeling value of the unlabelledvector.
 16. The non-transitory computer-readable storage medium of claim15, wherein the instructions that when executed by the one or moreprocessing resources of the computer system cause the one or moreprocessing resources to use the combination of the first angle and thesecond angle to determine the labeling value of the unlabelled vectorinclude instructions that cause the processing resource to: determinethat the first angle is less than the second angle; and identify thefirst angle as a minimum angle based at least in part on determiningthat the first angle is less than the second angle.
 17. Thenon-transitory computer-readable storage medium of claim 16, wherein theinstructions that when executed by the one or more processing resourcesof the computer system cause the one or more processing resources to thecombination of the first angle and the second angle to determine alabeling value of the unlabelled vector further include instructionsthat cause the processing resource to: compare the minimum angle with athreshold value.
 18. The non-transitory computer-readable storage mediumof claim 17, wherein the instructions that when executed by the one ormore processing resources of the computer system cause the one or moreprocessing resources to the combination of the first angle and thesecond angle to determine a labeling value of the unlabelled vectorfurther include instructions that cause the processing resource to:identify the unlabelled vector as a high value labeling target where theminimum angle exceeds the threshold value.
 19. The non-transitorycomputer-readable storage medium of claim 17, wherein the thresholdvalue is user programmable.
 20. The non-transitory computer-readablestorage medium of claim 15, wherein the instructions that when executedby the one or more processing resources of the computer system furthercause the one or more processing resources to: using the labeling valueof the unlabelled vector along with the result of at least one otherheuristic to rank the unlabelled vector relative to other unlabelledvectors.
 21. The non-transitory computer-readable storage medium ofclaim 20, wherein the at least one other heuristic is selected from agroup consisting of: a Shannon's entropy heuristic, a confidence basedheuristic, a distance from decision hyperplane heuristic, an informationdensity heuristic, a perturbation heuristic, an expected gradient lengthheuristic, and a consensus based heuristic.